This page containts supporting material for the course “Solving Business Problems with R”.
Here I will present the basics. Installing R and RStudio. Basic syntax, mathematical operators, objects, classes, structures.
What you’re gonna need is to install the R software environment here for your operating system.
Click on the link to your OS and click on the latest version to install, for mac it’s the latest .pkg file on top of the list (4.0.0)
Then install RStudio, which is an integrated development environment or IDE. It is where you will spend most time with R, it has a welcoming layout and many functions that make R more accessible if you’ve only worked with graphical user interface tools for analyzing data, such as excel.
To install RStudio, see https://rstudio.com/products/rstudio/download/#download for the free desktop version.
Once you’ve installed RStudio you should see something like this.
Let’s start by learning about RStudio, which is an Integrated Development Environment (IDE) for working with R.
The RStudio IDE open-source product is free under the Affero General Public License (AGPL) v3. The RStudio IDE is also available with a commercial license and priority email support from RStudio, Inc.
We will use the RStudio IDE to write code, navigate the files on our computer, inspect the variables we create, and visualize the plots we generate. RStudio can also be used for other things (e.g., version control, developing packages, writing Shiny apps) that we will not cover during the workshop.
One of the advantages of using RStudio is that all the information you need to write code is available in a single window. Additionally, RStudio provides many shortcuts, autocompletion, and highlighting for the major file types you use while developing in R. RStudio makes typing easier and less error-prone.
It is good practice to keep a set of related data, analyses, and text self-contained in a single folder called the working directory. All of the scripts within this folder can then use relative paths to files. Relative paths indicate where inside the project a file is located (as opposed to absolute paths, which point to where a file is on a specific computer). Working this way makes it a lot easier to move your project around on your computer and share it with others without having to directly modify file paths in the individual scripts.
RStudio provides a helpful set of tools to do this through its “Projects” interface, which not only creates a working directory for you but also remembers its location (allowing you to quickly navigate to it). The interface also (optionally) preserves custom settings and open files to make it easier to resume work after a break.
File menu, click on New project, choose New directory, then New project~/data-carpentry)Create projectscript.R”.Let’s take a quick tour of RStudio.
In the panel called Console you can play around with what R can do. One of the most important things to do first is to install packages which will be usefull and/or necessary for solving the problems in this course and beyond.
To install your first package, paste this into the console field specified on the screnshot:
install.packages("dplyr")
Once the package has been installed, to use it, we will need to load it to use it
library(dplyr)
In any learning any programming language your best weapon is documentation. In R studio you can view it directly in the help window in the lower left corner. To call the documentation and find out more about a function or package simply add a question mark before the name and run it. For example, one of the two functions we’ve used so far is library, let’s find out more about it.
?library
The help box reveals all the useful information about the function, such as the description, the arguments the function takes, examples of use, related functions and the package the funtion comes from. In the case of library, the package is {base}, a related function is require, the description yields that the function ‘loads and attaches packages’, the arguments include for example quietly, which is a boolean that controls whether library prints out that the attaching of the package was successfull or not. Under ‘Usage’, you can also see the what are the default values of the arguments, for example, see that quietly has a default of FALSE so library will print to alert you of a successful attachment, unless you explicitly tell it not to:
library(stargazer, quietly = TRUE) # I attach stargazer which is a package for latex tables
If you prefer to learn through videos, here is a comprehensive intro covering the bases by “How to R”. It explains the interface of RStudio and the basic capabilities and is very accessible.
There is a plethora of resources online that explain the basics well.
If you are someone who learns best by doing, I recommend the swirl package, which is a built in learning experience in the R console. You can start it by installing the package, calling it and finally launching swirl. See the code below. (You can paste it directly into your console!) Once you launch, choose the R Programming course and start with “Basic Building Blocks”.
install.packages("swirl") # install, you only need to do this once
library(swirl) # call
swirl() # launch
If you prefer to read, see Chapter 4 in the book by one of R’s creators Hadley Wickham called “R for Data Science”. Bear in mind that Wickham starts his book off with data visualisation, hence the exercise portion assumes that prerequisite.
Lastly, if you’re a viewer, the accessible option is a video by “How to R” which covers all the basics like using R as a calculator, assigning values and types of objects well, if slightly superficially. It is accessible, inviting and short.
For the student that wants to go deeper, I recommend a video from Berkeley’s R bootcamp. It covers the same bases as the video above, but expands on them greatly. Contrary to the one above, it includes conceptual commentary, being a part of a lecture. What is particularly beneficial is that it does not brush over vectors and is very thorough with data structures. Unsurprisingly, it is also nearly three times the length of the previous one.
“There are no routine statistical questions, only questionable statistical routines.” — Sir David Cox
To analyze any dataset you’re being faced with in a business situation you will need to first understand it, and understanding comes through asking the right questions. In the later parts you will gain tools which will let you find answers. It is of utmost importance to ask the correct questions first. There is no routine that will let us do that with 100% accuracy, but there is a few good resources out there that teach good habits.
There are a few very basic functions that let us explore the data and are omitted in many tutorials. The first thing you must do when getting a dataset is to load it into R, most likely using read.csv(). Here, I am going to use the built-in iris dataset. But for now, I do not know anything about it.
First, I would like to find out how does it look, by taking a look at the first couple of observations and the variables (columns), so essentially the “top of the table”.
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Now, I have my very preliminary overview. It seems as though everything is in place, the variables are not confused with observations and the data looks to be parsed correctly. But what I still do not know is what are these variables and what do the observations represent. For that, I will use the str() and summary() functions. The first one stands for structure and gives a compact overview on how many observations does the data frame have, and crucially the number of variables, their full names and type. summary() returns a table basic descriptive summaries of the data by variable. See the outputs for the iris dataset below.
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
You might also want to find out what’s the class of your data, because there are certain operations that are reserved for fx. data frames. To find out use, class().
Lastly, you might want to get a good old-fashioned look at the table just like you did back in your excel days. To open your data in a new “tab” in RStudio use View(). If you want to just look at one of the variables, for example the Species from the iris dataset, you can use View(iris$Species). For looking specific observations (rows), for example the second and fiftieth ones in the iris dataset, type in View(iris[c(2,50),]). Copy and paste the code below into your RStudio to try it out.
View(iris$Species) # open the Species column in a new tab
View(iris[c(2,50),]) # open the 2nd and 50th rows in a new tab
Once again, “Data Science with R” offers a comprehensive explanation of the topic, with examples and exercises to offer. I strongly recommend to take a look at Chapter 7. The chapter is heavy on data visualisation which has not been covered yet, so a discaimer that not all the code will make sense nor should it. Alternatively, you may familiarize yourself with Chapter 3 and be ahead of the curve.
The most important takeaway is understanding the two fundaments of EDA, variation and covariation.
If you learn better by watching, here is a video on univariate EDA, that explores the first fundament, variation. I recommend watching the videos in this section with an enhanced speed 1.25 or 1.5.
The second part on EDA from James Dayhuff deals with multivariate analysis, or as Wickham puts it, covariance.
Alternatively you can learn EDA directly in RStudio via the swirl package covered above. Enter the code below in you console and follow the prompts! (fair warning: includes a lot of visualisation!)
library(swirl)
install_course("Exploratory_Data_Analysis")
swirl()
In this section we will explore how to work with datasets which are not perfect and need some ‘cleaning’ before they can be used for business purposes, like reporting. But, even if it is clean you’re gonna need to know how to organize it.
You’re gonna need to install a package of packages for that if you haven’t already. As you might have guessed, it’s tidyverse, your one-stop shop for getting started with manipulating data. Mind you, there is a raging debate on whether to start with tidyverse or base R, and I do encourage you to check it out if you want to dig deeper. Both have their pros and cons, here, you are gonna learn the tidyverse right away for faster results, but bear in mind that there is another way..
Okay, now that that’s out of the way, let’s install the tidyverse.
install.packages('tidyverse')
library(tidyverse)
Having done that, see the cheatsheet below of the most important organizing functions and their purpose.
filter() # select only the rows that match your criteria
arrange() # sort the rows by one or more columns (or more complex)
select() # select only the columns you need
mutate() # add new columns, fx by making operations on the others
group_by() # groups the data frame to individual columns
summarise() # used with group_by to get summaries (mean, sum) for groups
Note that all the dplyr verbs do not change the original dataset, and if you use them without assigning them to variables like this:
filter(iris, Species == "setosa") # filter only the flowers of 'setosa' species
We can now look at our iris dataset to see whether it has changed.
View(iris) # to op4n a tab with the dataset
str(iris$Species) # to see if we dropped any Species from the column
It’s clearly remained whole. Therefore, whenever you use dplyr verbs remember to assign the new data frames to new variables, if you’re gonna use them later.
For a comprehensive cheatsheet with more verbs, look to the RStudio’s materials on dplyr, which is the tidyverse package all these functions belong to.
I suggest you bookmark it for future use. I did.
As always, I refer you to “R for Data Science”, the material needed for understanding this section is Chapter 5 on Data Transformation.
The resource section of course also caters to the perpetual YouTube viewer, with Data School’s intro do dplyr which covers all the bases accessibly. It also comes with a website similar to the one you’re on, that makes it easier to follow along. Save time by watching this video with 1.25 times the original speed :-)
It might very well happen that you will need to do some legwork in order to start using your dataset for gathering insights. It is often said that data scientists spend up to 80% of their time cleaning data (Dasu and Johnson 2003), which is important to keep in mind when practicing data science as well as managing a team of data scientists. An additional important note is that tidying refers to arranging datasets such that each variable is a column and each observation (or case) is a row (Wickham, 2014), data cleansing or cleaning has many definitions and can either refer to a general process of “transforming raw data into consistent data that can be analyzed”(de Jonge and van der Loo, 2013) or more specifically to removing/fixing duplicates, corrupted, incomplete data (Müller and Freytag, 2003). This section will mostly deal with tidy data and later introduce merging of datasets.
Most commononly encountered problems with data sets include:
And many more that are outlined and discussed in an article on tidy data that can be found on CRAN. The article provides guidelines and solutions for frequently encountered problems as well as ready-made code chunks.
For an even more detailed and theoretically comprehensive outlook on tidy data look to chapter 12 of the book and do the exercises. Should that not be enough, you can turn to Hadley Wickham’s academic article in the Journal of Statistical software.
The most important functions are captured in the cheatsheet below, which you can download here.
In terms of video content, I’d recommend a pretty specific video cleaning a sales promotion dataset. It’s straightforward but also quite a narrow example, it’s only 23 minutes and the teacher is engaging, do give it a try.
It showcases an important feature of this part of a data scientist’s job. Namely, that all tidy datasets are the same, but all messy ones are messy in their own way, so there is a lot of extra non-routine work involved in figuring out exactly how to fix any specific case. That said, we’re programming here, so once you’ve solved it and a new batch of data comes through next month, you can just run the same script on it or even have the machine do it for you every Monday. How cool is that?
Very cool.
Sometimes when you’re doing analysis you might collect your data from multiple sources and then put it together, for example sales data from multiple branches of the business that is being sent every week by managers to the data analysis team. Good news is you can very effectively use tidyverse for that purpose, bad news is that it often will require some work if the two datasets are not standardized in the same way. Therefore, what you want to do for each of them first is ensuring that they are all tidy and then merging them easily into a master data frame that will inherently also be tidy. In other words, if you add tidy and tidy together you get tidy. On the other hand, if you add messy and messy, you get a whole new kind of messy that can be very difficult to clean.
For a video introduction to merging, check out the webinar from RStudio itself, I recommend watching the whole thing for good fundaments of using dplyr and refreshing on some stuff already covered. The merging part begins at 41:50, see the embed below.
The next topic to cover is the first one that deals more with math than it does with programming. Cluster analysis is finding subgroups of observations within a data set. Usually if we plot a few observations based on two variables, we can do this easily by “eyeballing” them, but computers can’t. They don’t know how many clusters to build or how to put them together. If we manage to teach them that, we can exceed the ability of our own eyes and include many more variables and have much better clusters. Cluster analysis is what computer scientists would call an “unsupervised learning method”, because there is no response variable (“y”) and the algorithm instead of predicting something, tries to find similarities in the existing data and group it. Incidentally, cluster analysis is also probably your very first machine learning model, so that’s one buzzword off the bucket list.
On a serious note, as you might already know, there are two most popular approaches to algorithmically solving the clustering problem, k-means and hierarchical clustering. In this section I will focus on what is probably the most popular one, k-means. (Seif, 2018)
Now, as I stated before, it is more of a mathematical challenge than it is a programmatic one. Once one understood both of the methods and can apply one, R offers packages (e.g. cluster) with built in functions that combined form a ready-made recipe for performing cluster analysis.
If you feel rusty on the theoretical side and would like to code along a more math-heavy example, I recommend this article from the University of Cincinnati (there are other great resources on the same page which are worth checking out).
For fast results and an easy walk-through check out the Youtube video below. In the beginning it deals with important measures to take to prepare one’s dataset for clustering and then dives straight into the hierarchical method to later emerge with a k-means around 14:45.
You might recall the iris dataset we’ve talked about above. Let’s try performing clustering on it and do a walk-through. First, if your machine doesn’t have the packages, we better install them. PS we’re also gonna use tidyverse but I assume by now it’s all installed.
install.packages(c("cluster", "factoextra"))
We’re also gonna load them once we’re at it.
library(tidyverse)
library(cluster)
library(factoextra)
Now we can get onto the first step which is loading our data which in this case is built-in. We’ll assign it to a variable to store it and perform operations on it.
df <- iris
Now, as you probably remember, iris has 5 variables, the first four are continuous and store information about widths and lengths of sepals and petals, the fifth is a categorical variable that stores species. The first step before doing clustering is often to plot any two continuous variables that we might suspect have an illustrative relationship and try to ‘eyeball’ the clusters. In this special case it would not be crazy to assume that clusters could be species of irises, therefore I will add a colour to see whether that’s a reasonable assumption. I also add a theme_bw, because I can’t stand the grey default of ggplots. Highly recommend.
ggplot(data=iris, mapping = aes(y = Petal.Length, x = Petal.Width)) +
geom_point(mapping = aes(color = Species))+
theme_bw()
Here we see that the Species mostly follow what we’d expect by eyeballing in terms of the Petals. This is of course a subjective exercise but if I were gonna cluster the flowers with my eyes based only on those two variables and without knowing about the Species, it’d look something like this.
ggplot(data=iris, mapping = aes(y = Petal.Length, x = Petal.Width)) +
geom_point(mapping = aes(color = Species)) +
geom_point(data=iris[iris$Petal.Width<0.75,],
pch=21, fill=NA, size=3, colour="red", stroke=1) +
geom_point(data=iris[iris$Petal.Width>0.75&iris$Petal.Width<1.75,],
pch=21, fill=NA, size=3, colour="green", stroke=1) +
geom_point(data=iris[iris$Petal.Width>1.75,],
pch=21, fill=NA, size=3, colour="blue", stroke=1) +
theme_bw()
As you can see it’s mostly the same as the Species, apart from some Virginica and Versicolor tension there where some blue points get encircled by the green cluster. Eyeballing is a good exercise to have an idea about the dataset and how many clusters do we expect there to be. Now, we want to see how the computer tackles this task based on 4 continuous variables, (we can only effectively eyeball 2 at once). How many clusters will our k-means have and how will they look like? Let’s find out.
First, we need to prepare the data set, df we set above for analysis. For that I will remove any categorical variables, Species, which is the fifth column of our dataframe. Then, we need to standardize the other metrics so that they have a mean of 0 and a standard deviation of around 1. The scale function takes care of that by centering and scaling.
df <- scale(df[,-5])
Now we have to compute the distance between our observations, for which we’ll use the factoextra package. get_dist automatically computes said distance for us, with the default being the standard Euclidian version \(\begin{equation} d_{euc}(x,y) = \sqrt{\sum_{i=1}^n(x_i - y_i)^2}\end{equation}\). That’s the one we will use, but be advised that other distance metrics exist. After that we will visualise said distances on a heatmap.
distance <- get_dist(df)
fviz_dist(distance,show_labels = FALSE)
This visualisation yields that there are large differences between flowers in this dataset (purple) and there are also a fair amount of these that appear to be a part of the same cluster (red). I don’t show the labels of the observations, because individual flowers have no meaningful names, but it is a perfectly sound thing to do with fewer observations that have meaningful labels, e.g. EU countries.
Now that we know clusters exist, we have to determine the number of clusters (k) to use the k-means algorithm for clustering. We’ll use what’s called, the elbow method to visualise the within clusters sum of squares, WSS, for different numbers of clusters, k.
set.seed(123)
wss <- 1 #placeholder
for (i in 1:10) wss[i] <- sum(kmeans(df, centers=i)$withinss)
ggplot(mapping = aes(1:10, wss))+
geom_point()+
geom_line()+
scale_x_continuous(name = "Number of clusters", breaks = seq(0,11,1))+
scale_y_continuous(name = "Within cluster sum of squares")+
theme_bw()
print(wss,digits = 0)
## [1] 596 221 139 114 105 81 73 64 65 59
This is somewhat of a complicated set of commands, so let’s go through them one by one. First, we set seed so that the kmeans algorithm that we will run will come at the same result every time we run this code, because there is some randomness to the wss value for clusters. Then we define wss as the sum of squares variation for values of k from 1 to 10 and then we plot it for values of k. Notice that when the kmeans function’s argument centers effectively means k, number of clusters.
Lastly, we look for the “bend” or “elbow” in the data - the point where adding more clusters does not monumentally decrease wss. Here it seems to be 3 or 4. Having the benefit of knowing about there being three Species, I will choose 3 going forward, but it’s close, as you can see both based on the plot and the printout of wss values.
You can also use the function built-in the factoextra package, that does the same thing but in a less customizable fashion and without letting you know what calculation is going on in the background. Its arguments are the normalized dataset df, algorithm kmeans and method "wss".
fviz_nbclust(df, kmeans, method = "wss")
Now that we’ve determined using the elbow method that the appropriate number of clusters is 3, we can see how does the algorithm divide them based on all four variables. We visualise the results using the function fviz_cluster which does that neatly for us.
iris_cluster_3 <- kmeans(df,centers = 3, nstart = 25)
fviz_cluster(iris_cluster_3,data=df,ggtheme = theme_bw())
Here we have the clusters visualised over the principal dimensions that contributed to variability the most. Although this looks good we might be interested in how that does compare to our initial plotting of the flowers based on Petal lengths and widths by Species. It is also interesting to see whether the Species are the clusters.
iris_clustered <- iris %>% mutate(cluster = iris_cluster_3$cluster)
ggplot(data = iris_clustered, aes(Petal.Width,Petal.Length)) +
geom_point(mapping = aes(color = Species)) +
geom_point(data=iris_clustered[iris_clustered$cluster==1,],
pch=21, fill=NA, size=3, colour="red", stroke=1) +
geom_point(data=iris_clustered[iris_clustered$cluster==2,],
pch=21, fill=NA, size=3, colour="green", stroke=1) +
geom_point(data=iris_clustered[iris_clustered$cluster==3,],
pch=21, fill=NA, size=3, colour="blue", stroke=1) +
theme_bw()
Based on this we can see that the algorithm has mostly correctly clustered the flowers into Species, and that setosa is much different from the other two which have some overlap. Bear in mind that the computer has not only considered the petals which I visualised here but also the other two variables in making the clustering decision.
As a last note, we might want to investigate exactly how correctly the flowers were clustered.
table(iris_clustered$Species,iris_clustered$cluster)
##
## 1 2 3
## setosa 0 50 0
## versicolor 11 0 39
## virginica 36 0 14
Now that we have the table we can calculate the number of correctly clustered flowers and the success ratio. \[ rate_{success} = \frac{successes}{nrow(iris)} = \frac{50+39+36}{150} \approx 83.3\% \] It looks like our clustering was about 83% correct. Of course most of the time we’re not gonna have such a neat measure of success as with the flowers example but it well illustrates that clustering is effective and useful. What if the species classification was lost from the data?
You’ve probably heard of modeling before, it seems that nowadays everyone is answering all questions they have with fitting a model, modeling a relationship or simply trying to predict the sales. Well the days of being left out of this conversation are over and we’re now going to start modeling ourselves. We’re gonna start with a detailed walk-through of a simple linear model and then go onto a multiple regression.
The first type of regression we’ll tackle is, well, simple. It is going to try and model the relationship between two variables, one dependent and one independent, which also gets called explanatory or a predictor. So, first we have to determine whether there is any ground to assume that a relationship between two variables exists, that they are correlated. Note that we do not require one to be caused by the other. Although that might be the case, the general rule is:
Correlation does not imply causation!
So let’s say we are interested in whether there is a relationship between sales of a company and the advertising budget, specifically how and if the budget can affect the sales. Hence sales will be our dependent variable and advertising budget will be our independent variable. See the sample regression line: \[\begin{equation} \hat{Sales}_i = \hat{\beta}_0 + \hat{\beta}_1 {Budget}_i + \hat{\epsilon}_i \end{equation}\]
I got the dataset marketing from a package called datarium. It has a default of 4 variables, 3 that represent budgets for different advertising media and one for sales. Because we’re doing a simple regression, I will create a new column, marketing$Budget, that stores the total advertising budget. Then we can inspect whether there is grounds to think that a relationship exists. Then I perform a regression using the lm function and store it in a variable marketing_model. Notice the syntax of lm is lm(dependent ~ independent, data).
library(datarium)
marketing_df <- marketing %>% mutate(budget = rowSums(marketing[,1:3]))
ggplot(marketing_df,aes(budget,sales))+
geom_point() + stat_smooth(method = lm, se = 0) + theme_bw()
marketing_model <- lm(sales ~ budget, marketing_df)
Later a very important step before we can start using the model we built in the last line is checking the assumptions of our chosen technique - simple linear regression. R makes it very easy with the built in plot function that gives us the four plots we need. Read more about assumptions here.
par(mfrow = c(2, 2)) #show plots next to each other 2x2
plot(marketing_model)
Based on the first plot, we can declare that there is indeed a linear relationship between the two variables because the residuals are evenly distributed around subsequent fitted values. You can easily examine that by looking at the red line, if it sticks to 0, there is no problem. Due to fewer observations towards the higher Sales values the relationship shifts slightly.
The second plot shows us quite clearly that the vast majority of residuals are normally distributed.
The third plot shows no clear pattern of variance which is desirable.However, it is not an ideal situation in which the red line would be entirely horizontal, indicating equal spread of variance along the values of \(y\). On the fourth plot, influtential points are shown, although there is no reason to assume that any outliers need to be removed, especially since linear regression is a robust method. You can read more here. The last assumption of the regression is independence of residuals, we can assume that the entries in this dataset are independent of one another if it is different companies. However, the documentation of the dataset does not shed light on this.
Now, that we got assumptions out of the way, and they all seem to be satisfied we can examine our model, it’s significance and how much of the variance it explains.
summary(marketing_model)
##
## Call:
## lm(formula = sales ~ budget, data = marketing_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.6655 -1.5685 0.1408 1.9153 8.6274
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.091634 0.526230 9.676 <2e-16 ***
## budget 0.048688 0.001982 24.564 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.12 on 198 degrees of freedom
## Multiple R-squared: 0.7529, Adjusted R-squared: 0.7517
## F-statistic: 603.4 on 1 and 198 DF, p-value: < 2.2e-16
Additionally, let’s look at the mathematical representation of the model: \[\begin{equation} \hat{Sales}_i = 5.0916 + 0.0487 {Budget}_i + \hat{\epsilon}_i \end{equation}\] Looking at the p-values of the coefficients (all under 0.001) and the R-squared, we can confidently fail to reject the hypothesis that advertising budget has no effect on Sales. Our model’s R-squared yields that the budget explains around 75% of the variation, which is remarkably high. We can read the model’s budget coefficient as ‘For each 1000 USD spent on advertising, approximately 48.7 USD returns in form of sales’. That’s not terribly efficient, but maybe the companies are in the process of building a customer base and therefore spend a lot on creating awareness. Remember that there is also the intercept which on its own should not be interpreted, but will be adding 5000 USD to a predicted Sales value. This model is of course not perfect and there is other possible pitfalls such as the ommitted variable bias.
Additionally, the summary of the model output in R is explained in StatQuest’s short video.
Let’s say your boss looks at the report of your results from the previous section and e-mails you that he’d like to know which of the media mix is most efficient in impacting the sales based on the data. There, you’re looking at a multiple regression problem, with one dependent variable, same as before, Sales, and three independent variables, budgets of the three channels your company advertises through, youtube, facebook and newspaper. First, you write out the model: \[\begin{equation} \hat{Sales}_i = \hat{\beta}_0 + \hat{\beta}_1 {youtube}_i + \hat{\beta}_2 {facebook}_i + \hat{\beta}_3 {newspaper}_i + \hat{\epsilon}_i \end{equation}\] Then you dive straight into R.
marketing_model_multiple <- lm(sales ~ youtube + facebook + newspaper, marketing_df)
Then the discussion of assumptions should follow, but seeing as we’ve discussed them in detail above, I will only focus on the one assumption that occurs in a multple regression and does not in a simple linear. Said assumption is independence of variables, or lack of multicollinearity. There are several ways of invetigating that, I will use a simple correlation plot, but you may also want to watch the VIF method.
library(corrplot)
corrplot(cor(marketing_df[,c(1:3)]), method = "circle")
Here we see that there is some multicollinearity, especially between facebook and newspaper, which is over 0.35. Multicollinearity could result in bloated standard errors and in turn, larger p-values for the two variables. Although it is worth noting that it does not affect the overall model fit, so R-squared or F does not get distorted. Additionally, as a rule of thumb, any correlation between the variables \(r\leqslant0.9\), should not be worrying, so we can safely proceed. Let’s draw up the summary of the model then:
summary(marketing_model_multiple)
##
## Call:
## lm(formula = sales ~ youtube + facebook + newspaper, data = marketing_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.5932 -1.0690 0.2902 1.4272 3.3951
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.526667 0.374290 9.422 <2e-16 ***
## youtube 0.045765 0.001395 32.809 <2e-16 ***
## facebook 0.188530 0.008611 21.893 <2e-16 ***
## newspaper -0.001037 0.005871 -0.177 0.86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.023 on 196 degrees of freedom
## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
## F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
Newspaper looks to be highly insignificant, therefore we can safely drop it and run a restricted version of the model.
summary(lm(sales ~ youtube + facebook, marketing_df))
##
## Call:
## lm(formula = sales ~ youtube + facebook, data = marketing_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.5572 -1.0502 0.2906 1.4049 3.3994
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.50532 0.35339 9.919 <2e-16 ***
## youtube 0.04575 0.00139 32.909 <2e-16 ***
## facebook 0.18799 0.00804 23.382 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.018 on 197 degrees of freedom
## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8962
## F-statistic: 859.6 on 2 and 197 DF, p-value: < 2.2e-16
\[\begin{equation} \hat{Sales}_i = 3.5053 + 0.0458 {youtube}_i + 0.1880 {facebook}_i + \hat{\epsilon}_i \end{equation}\]
We can quickly determine that this model is better at capturing the variation in Sales, than the previous simple linear one, because of the higher R-squared, lower RSE and higher F-statistic. The coefficients next to the independent variables can be interpreted as such, ‘For each 1000USD spent on YouTube advertising, sales rise by 45.8USD. Analogically for Facebook adverts, sales rise by 188USD’. Now you can present the results to your boss, which indicate that Facebook is the most efficient medium, while it couldn’t be rejected that newspaper budget has no influence over sales. You might also want to add that possibly there are other variables that you don’t have access to in play, so he should ask other people in the company for input before pulling all the money from the local newspaper to invest it in Facebook ads.
If you’d like watch a video on a different multiple regression example in R, I recommend the one below.
Additionally, if you’re interested in how to use R to model, with a much broader set of tools, as always, I refer you to Hadley’s book.
In this section we take a step back from modeling and analyze the distribution of our data. Before we largely assumed that it’s normally distributed, now we are going to investigate whether that assumption was a warranted one.
One of the best ways to see if some univariate data (one variable/column) matches a certain distribution is drawing up a quantile-quantile plot. If you’d like to brush up on what a quantile is, you can do so here. The Q-Q plots the quantiles of your sample on the y axis and the theoretical quantiles, for example from the normal distribution on the x axis. Then if the plotted quantiles (‘circles’ on the graph) fall on a linear function that means that the distribution of the variable comes approximately from some theoretical distribution. In this part we’re going to focus on checking if our distributions are normal.
Consider the marketing data from the previous sections. Suppose we wanted to check whether the mean of budgets for Newspaper and Facebook channels is the same. To check that on a 95% significance level we want to perform a t-test. One of the assumptions is that the variables are normally distributed. Let’s use a Q-Q plot to verify that visually.
par(mfrow = c(1,2))
qqnorm(marketing$facebook)
qqline(marketing$facebook, col = 2)
qqnorm(marketing$newspaper)
qqline(marketing$newspaper, col = 2)
Based on these plots, we can see that most of the values (quantiles) of the variables do follow a normal distribution but for the lowest and highest. To the naked eye it looks as though the assumption is mostly correct. However, R is a statistical software and we’re always talking about tests, so is it possible to test for normality to be sure (on a 95% level, of course)?
Yes indeed, there is a couple of those tests, but the most popular one is called is the Shapiro-Wilk test. In R, we can easily do it using the built-in stats package’s shapiro.test function. I hope apologies to Wilk have been made.
shapiro.test(marketing$facebook)
##
## Shapiro-Wilk normality test
##
## data: marketing$facebook
## W = 0.94401, p-value = 5.198e-07
shapiro.test(marketing$newspaper)
##
## Shapiro-Wilk normality test
##
## data: marketing$newspaper
## W = 0.9364, p-value = 1.127e-07
The results of the test yield that the null hypothesis, which is “the data belongs to a normal distribution”, is rejected for both of the variables. Therefore, we technically cannot assume normality based on the Shapiro-Wilk test, even though it looked promising on the QQ-plot. In both of the distributions we can observe what is called “heavy tails”, as in more extreme points than we’d expect in a normal distribution. Read more about interpreting QQ plots here.
Lastly, be wary that there is an argument for normality testing being ‘essentially useless’, especially for bigger sample sizes.
In this section we will introduce methods of making our data more normal and some non-parametric tests.
There are several ways to transform data to make it fit the normal distribution better. The easiest, most straightforward ways are logarithm, square-root and reciprocity. Incidentally, these are also the most wide-spread. Read more about ‘simple’ transformations here.
It’s easy to implement those transformations in R, here we will also visualise a QQ plot to try and assess whether we’ve improved the distribution of facebook from previous examples.
par(mfrow = c(1,3))
qqnorm(log10(marketing$facebook+1), main = "Q-Q (log transformation)")
qqline(log10(marketing$facebook), col = 2)
qqnorm(sqrt(marketing$facebook), main = "Q-Q (square-root transformation)")
qqline(sqrt(marketing$facebook), col = 2)
qqnorm(1/(marketing$facebook+1), main = "Q-Q (reciprocity transformation)")
qqline(1/(marketing$facebook+1), col = 2)
As we can see on the QQ plots, neither of the transformations improved the distribution’s likeness to the Gaussian curve. Therefore, we might look to some more complex transformations such as the Box-Cox transformation, Yeo-Johnson and more. Those methods are much more difficult to interpret and implement. I recommend reading this CRAN vignette about the bestNormalize package designed for picking the best method and using the methods with the MASS package.
Another approach to having a non-normally distributed data is changint the statistical method of testing rather than trying to “normalize” the data which clearly isn’t. Such statistical methods are called non-parametric, because they do not estimate the parameters assuming a distribution.
Some of the most popular tests you might come across are: Mann-Whitney U, Wilcoxon Signed Rank, Kruskal Wallis, and Friedman tests. For an R guideline, click here.
In our example, what we wanted to do in the first place was compare the means of two numeric variables, facebook and newspaper using a t-test. Since, we’ve found out that our variables are not normally distributed, tried some basic transformations which did not improve the fit. Therefore, we will use a non-parametric alternative, an independent 2-group Mann-Whitney U Test, also called a Wilcoxon rank sum. The only assumption is that our data is independent and randomly drawn. As we discussed earlier, this is true if our data is all for different randomly drawn companies and not the same one over time. We have no reason to assume otherwise, although the documentation is not clear.
wilcox.test(marketing$newspaper,marketing$facebook, conf.int = TRUE)
##
## Wilcoxon rank sum test with continuity correction
##
## data: marketing$newspaper and marketing$facebook
## W = 23362, p-value = 0.003638
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
## 2.039959 10.440034
## sample estimates:
## difference in location
## 6.12001
\(P-value < 0.05\), therefore we reject the null hypothesis that the mean budgets for the two media are the same. In fact, we find that facebook budgets are higher by approximately between USD 2,000 and 10,000 (remember that our data is in thousands) on a 95% confidence level.
For a video featuring a non-parametric test for different data in R, check out the one below.
It seems that we have already used data visualisation extensively for other tasks without ever talking about data visualisation. So, I will assume that you already have some practical experience with graphing but need a general explanation to start creating your own plots without problem.
The tool we are going to use is ggplot2 by Hadley Wickham which is based on the Grammar of Graphics, developed by Leland Wilkinson. In it, each individual component of a plot is organized into layers which is how ggplots are usually built, with the base ggplot function and added (+) layers that put geometric objects on the graph, such as geoms (e.g. geom_point which adds scattered points). Below I give you the general recipe for a ggplot.
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION> +
<THEME>
No one can explain ggplot better than its creator, so I really really encourage you to read the chapter in the book by, you guessed it, Hadley Wickham. Jump to chapter 3 here.
Another brilliant reading is “Data visualization: A practical introduction” by Kieran Healy which you might know from lectures. Assuming that you’re already familiar with the basics of how to get around in R, I’d suggest starting at Chapter 3 and going as far as possible.
Not instead, but as an addition, before or after reading, you can check out a Data Science Dojo webinar on ggplot. The speaker analyzes a famous dataset about titanic. It features, bar, box and and density plots alongside histograms. It also handles faceting, proper theme and labelling. I embedded the webinar below. It may seem a tad long but the proper R section begins around the 26th minute and 1.25 speed is encouraged. The R file for the session is here and the csv can be downloaded here.
If you want to see examples of some more complicated plots, I recommend checking out these resources from the University of Cincinnati. The first introductory section is somewhat reduntant with the book, but others present some very cool and advanced ideas for visualising many variables at once.
It can be overwhelming looking at all these plots and webinar makers that seem to know the grammar of graphics like the grammar of English or better. Well, fear not, nobody is required to remember any of this by heart, and many don’t. Cheatsheets are not only allowed, but even encouraged..
Ggplot2 is incredibly powerful and once you master it, truly beautiful things happen. I leave you with a visualisation of diamonds, with 4 variables at once, and yet quite readable. Color is classified alphabetically from D (best) to J (worst). The facets refer to the cut quality.
ggplot(diamonds,
aes(x = carat,
y = price)) +
geom_point(aes(colour = color),
alpha = 0.5,
size = 0.5) +
scale_color_brewer(palette = "PRGn") +
geom_smooth(se = FALSE,
linetype = "dashed",
colour = "black",
size = 0.4) +
facet_wrap(~cut) +
theme_minimal() +
labs(y = "price (USD)")
TBD
TBD
TBD